Multi-Classification of Dry Beans Using Machine Learning

## Libraries
library(knitr)
library(dplyr)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(corrplot)
library(class)
library(caret)
library(rpart)
library(maptree)
library(pca3d)
library(MASS)
library(tibble)
library(randomForest)
library(e1071)
library(kernlab)
library(kableExtra)
library(gridExtra)
library(reshape)
library(GGally)
library(readr)
library(readxl)
library(float)
library(patchwork)
library(ggeasy)
theme_set(theme_classic())
#load data
labeled <- read.csv('labeled.csv') %>% dplyr::select(-X)
sampA <- read.csv('samp.A.csv')%>% dplyr::select(-X)
sampB <- read.csv('samp.B.csv')%>% dplyr::select(-X)
sampC <- read.csv('samp.C.csv')%>% dplyr::select(-X)
#convert Class into factor
labeled$Class <- as.factor(labeled$Class)

#set up a new variable 'Roundness'
#Roundness = 4*Area*pi/(perimeter)^2 (refer to the dry bean paper)
Roundess <- 4*pi*labeled$Area/(labeled$Perimeter)^2
labeled <- add_column(labeled, Roundness = Roundess, .after = 7)
sampA$Roundness <- 4*pi*sampA$Area/(sampA$Perimeter)^2
sampB$Roundness <- 4*pi*sampB$Area/(sampB$Perimeter)^2
sampC$Roundness <- 4*pi*sampC$Area/(sampC$Perimeter)^2

#check for duplicate rows
dup.rows = sum(labeled%>%duplicated(), sampA%>%duplicated(),
               sampB%>%duplicated(),sampC%>%duplicated())

Introduction and Problem Statement

Dry bean- Phaseolus vulgaris L. is a major cultivated grain species in the genus Phaseolus that is widely consumed worldwide for its edible legume and pea pods (Heuze et al., 2015). Nevertheless, selecting the best seed species is one of the main concerns for both bean producers and the market. Since different genotypes are cultivated worldwide, it is important to separate the best seed variety from the mixed dry bean population, otherwise the market value of these mixed species of beans could drop enormously (Varankaya & Ceyhan, 2012). For more information on background problem and data, please refer to the associated article “Koklu, M., & Ozkan, I. A. (2020). Multiclass classification of dry beans using computer vision and ma-chine learning techniques. Computers and Electronics in Agriculture, 174, 105507.”

Aim of the Project

The aim of this project is to develop a machine learning method to perform a multi-classification of dry beans that could predict the net worth of a given bean species harvested from a ‘population cultivation’ from a single farm when presented in the market.

Exploratory Data Analysis

There are two datasets used in this project; The ‘labeled’ and ‘unlabeled’ datasets. The labeled dataset will be used to train the various ML models, hence also refered to as the training dataset. The training dataset contains 3000 observations and 8 variables. The dependent variable has 6 levels (Classes): BOMBAY, CALI, DERMASON, HOROZ, SEKER, and SIRA - These classes represent the different varieties of dry beans. In the training set, each class has 500 observations.

The unlabeled dataset is made up of three seperate samples, namely, Sample A, B, and C. these samples have 7 variables (excluding the dependent variable(Class) -thus unlabeled). The total observations for sample A, B, and C are 777, 1373, and 982 respectively. Roundness, which is the measure of how closely the shape of beans approaches a perfect circle, was calculated and added as an additional predictor variable to both labeled and unlabeled datasets (Koklu & Ozkan, 2020). Tables 1 through 4 show the summary statistics of the variables in the labeled data, Sample A, B, and C, respectively .

Summary

summary.stats <- round(as.data.frame((labeled[,-9])%>%
                                       psych::describe())%>%
                         dplyr::select(n,mean, sd, median, min, max, range, se), 2)

kbl(summary.stats, caption="Statistical distribution of features of dry beans varieties (in pixels) - Label")%>%
  kable_classic(full_width = F, html_font = "Cambria", font_size = 12)
Statistical distribution of features of dry beans varieties (in pixels) - Label
n mean sd median min max range se
Area 3000 69874.98 49578.52 48714.50 20645.00 251320.00 230675.00 905.18
Perimeter 3000 1012.24 347.75 941.90 384.17 2164.10 1779.93 6.35
MajorAxisLength 3000 362.05 124.52 332.90 161.52 740.97 579.45 2.27
MinorAxisLength 3000 225.19 73.35 202.73 106.00 473.39 367.39 1.34
Eccentricity 3000 0.76 0.10 0.77 0.30 0.94 0.64 0.00
ConvexArea 3000 70944.12 50382.27 50807.50 8912.00 259965.00 251053.00 919.85
Extent 3000 0.75 0.05 0.77 0.57 0.85 0.28 0.00
Roundness 3000 0.84 0.29 0.77 0.39 2.06 1.66 0.01

The variables, Area and Convex Area, had the largest range for all four datasets. There are large differences in the range of variables, the variables with larger ranges will dominate over those with small ranges which may lead to biased results, therefore it is necessary to transform/scale these variables before fitting our distance-based models (i.e., KNN and SVM).

Variance check

The variance of each variable by class shows evidence of non-constant variance (Tables 5 & 6). Based on the normality distribution and non-constant variance, we expect the QDA model to perform well.

var.tab1 <- labeled%>%group_by(Class)%>%
  summarize(Var.Area=var(Area),Var.Perimeter=var(Perimeter), var.Maj.Axis.=var(MajorAxisLength),var.Min.Axis.=var(MinorAxisLength), 
            var.Eccentricity=min(Eccentricity), var.ConvexArea=max(ConvexArea), 
            var.Extent=max(Extent), var.Roundness=max(Roundness))


kbl(var.tab1, caption = "Variance of distribution")%>%
  kable_classic(full_width = F, html_font = "Cambria", font_size = 12)
Variance of distribution
Class Var.Area Var.Perimeter var.Maj.Axis. var.Min.Axis. var.Eccentricity var.ConvexArea var.Extent var.Roundness
BOMBAY 552697974 32015.80 3177.8992 826.6840 0.5472634 259965 0.8502428 1.262811
CALI 91528410 26272.79 1188.1780 491.4581 0.6183656 117510 0.8427527 1.512446
DERMASON 24651963 22913.81 696.7121 498.7684 0.5494947 56174 0.8471957 2.055745
HOROZ 56765885 24960.58 1252.4894 456.5644 0.7227374 82462 0.8420894 1.620064
SEKER 22567179 24170.41 736.8507 419.4165 0.3006355 65674 0.8183099 1.990205
SIRA 22641401 23977.06 782.4715 401.8085 0.6098838 73945 0.8418021 1.718602

Histograms and Violin Plots

The histograms from the labeled data (Figure 1) show evidence of multimodality behavior in the variables. This means that at least one of the classes of beans is very distinct from the others. Upon further checks I found that this multimodality is caused by the BOMBAY beans type.Upon further checks, I found that this multimodality is caused by the BOMBAY beans type in both the labeled set and Sample A, but not in Sample B and C. This multi,odality behaviour implies that, they might be very low predictions of BOMBAY for Sample B and C.

The violin plots from the labeled data show that BOMBAY and CALI beans are very distinct from the other beans. It can be seen from the average points that Roundness and Extent seems to be a strong predictor for the SEKER. Eccentricity seems to be a good predictor to HOROZ. The violin plots for each class (Figure 5) shows that most of the class distributions are approximately normal except for the distributions for Roundness and Extent. From these distributions, we expect BOMBAY and CALI to be easily predicted by our models.

theme_set(theme_classic())

a <- labeled%>%ggplot() + 
  geom_histogram(aes(x=Area/1000), fill="brown", col="white") + 
  labs(title = "Histogram  of Area") + xlab("Area(in 1000s)")# + ggeasy::easy_center_title()
    
b <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Area/10000), col="brown")  + 
  labs(title = "Boxplot of class vs Area") + 
  geom_violin(aes(x=Class, y=Area/10000), alpha=0.4) + ylab("Area")
  #ggeasy::easy_center_title()


c <- labeled%>%ggplot() +
  geom_histogram(aes(x=Perimeter/10), fill="brown", col="white")+xlab("Peri")
    
d <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Perimeter/10), col="brown")   + 
  geom_violin(aes(x=Class, y=Perimeter/10), alpha=0.4) + ylab("Peri")




g <- labeled%>%ggplot() +
  geom_histogram(aes(x=MinorAxisLength), fill="brown", col="white") + xlab("MIX.Length")
    
h <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=MinorAxisLength), col="brown")   + 
  geom_violin(aes(x=Class, y=MinorAxisLength), alpha=0.4) + ylab("MIX.Length")


i <- labeled%>%ggplot() +
  geom_histogram(aes(x=ConvexArea), fill="brown", col="white") + xlab("Co.Area")
    
j <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=ConvexArea), col="brown")   + 
  geom_violin(aes(x=Class, y=ConvexArea), alpha=0.4) + ylab("Co.Area")



k <- labeled%>%ggplot() +
  geom_histogram(aes(x=Eccentricity), fill="brown", col="white") + xlab("Eccen")
    
l <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Eccentricity), col="brown")   + 
  geom_violin(aes(x=Class, y=Eccentricity), alpha=0.4) + ylab("Eccen")


m <- labeled%>%ggplot() +
  geom_histogram(aes(x=Extent), fill="brown", col="white") + xlab("Extent")
    
n <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Extent), col="brown")   + 
  geom_violin(aes(x=Class, y=Extent), alpha=0.4) + ylab("Extent")




img <- ((a + b) + plot_layout(widths = c(1, 2)))/
  ((c+d)+ plot_layout(widths = c(1, 2)))/ 
  ((g+h)+ plot_layout(widths = c(1, 2)))/
  ((i+j)+ plot_layout(widths = c(1, 2)))/
  ((k+l)+ plot_layout(widths = c(1, 2)))/
  ((m+n)+ plot_layout(widths = c(1, 2)))
 # plot_annotation(tag_levels = "A",tag_suffix = ")")

ggsave(filename = "explo1.png",
 width = 10, height = 8,
 dpi = 700)  

Figure 1: Exploratory Analysis of Labeled Data Set

Correlation of Variables

Most of the variables except for Eccentricity, Extent, and Roundness, are highly correlated (Figure 6) in each dataset. This behavior is also seen in the correlation of the variables by classes (Figure 7).

library(ellipse)
library(RColorBrewer)

# Build a Pannel of 100 colors with Rcolor Brewer
my_colors <- brewer.pal(3, "Accent")
my_colors <- colorRampPalette(my_colors)(100)


corrplot(cor(labeled%>% dplyr::select(-Class)), method = 'ellipse', type = "lower", outline = TRUE, tl.col = "brown", col=my_colors)
Correlation plot

Correlation plot

Principle Components Analysis

The principal component analysis (Figure 8) indicates that the first 3 principal components, which are new variables that are constructed as linear combinations or mixtures of the initial variables, explained more than 90% of all variance in the dataset.

#####pca#####
library(factoextra)
pca.labeled <- prcomp(labeled %>% dplyr::select(-Class), scale = TRUE)
pca.sampA <- prcomp(sampA, scale = TRUE)
pca.sampB <- prcomp(sampB, scale = TRUE)
pca.sampC <- prcomp(sampC, scale = TRUE)

cumsum <- as.data.frame(cbind(PC=as.factor(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")), cumsum=cumsum(pca.labeled$sdev^2 / sum(pca.labeled$sdev^2))))


pc <- fviz_eig(pca.labeled, 
         choice = c("variance"), 
         ggtheme = theme_classic(), 
         geom = c("bar"), 
         barfill = "brown", 
         barcolor = "black",  
         main = "Principal Component Plot",
         xlab = "Principal Components")

cum <- ggplot(cumsum, aes(x=PC, y=cumsum)) + 
  geom_point(col = "brown", size=2) + 
  geom_hline(yintercept=0.9, linetype="dashed", color = "brown", size=1) +
  labs(y="Cumm Var Exp",
  x ="Principal Components",
  title="Cummulative Variance Explained")


(pc + cum) + plot_layout(widths = c(1, 2))
Principal Component Analysis

Principal Component Analysis

## Construct labled.sc dataset and pca dataset

#construc scaled label data
labeled.sc <- as.data.frame(scale(labeled %>% dplyr::select(-Class)))
labeled.sc$Class <- labeled$Class

#construct pca label data
labeled.pca <- as.data.frame(pca.labeled$x)
labeled.pca$Class <- labeled$Class

It is used to measure the ratio of accurately estimated samples to the total number of samples. It can be considered that the model is the best if there is high accuracy in the model used.

#ctrl <- trainControl(method = "LOOCV")
#
#knn.train <- train(
#  Class ~.,
#  data = labeled,
#  method = "knn",
#  tuneGrid = data.frame(k=1:20),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  preProc = c("scale")
#)
#ggplot(plsFit) + labs(y="Accuracy(LOOCV)",
#  x ="Numb of Neighbors",
#  title="Best Number of Neighbors")

## KNN 


set.seed(12345)
AR <- NULL
for (k in 1:20) {
test <- knn.cv(labeled.sc[,1:8],cl=labeled$Class, k)
AR[k] <- mean(test==labeled.sc$Class)
AC <- as.tibble(cbind(k=1:20, AR=AR))
}

k <- which(AR==max(AC$AR)) # my result is 15/17



ggplot(AC) + 
  geom_line(aes(x=k, y=AR), color = "brown", size=1) +
  geom_point(aes(x=k, y=AR), color = "black", size=3) + 
  geom_vline(xintercept = 15, linetype="dashed", color = "brown", size=1)+
  labs(y="Accuracy(LOOCV)",
  x ="Numb of Neighbors",
  title="Best Number of Neighbors")

knn <- knn.cv(labeled.sc[,1:8],
                     cl=labeled.sc$Class, k=min(k))



conf <- confusionMatrix(data = labeled$Class, knn)

knn.acc <- conf$overall[1]
#ctrl <- trainControl(method = "LOOCV")
#
#qda <- train(
#  Class ~.,
#  data = labeled,
#  method = "qda",
#  #tuneGrid = data.frame(k=1:20),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  #preProc = c("scale")
#)

qda <- qda(Class~., data = labeled, CV = TRUE)


qda.conf <- confusionMatrix(data = labeled$Class, qda$class)


qda.acc <- qda.conf$overall[1]
#ctrl <- trainControl(method = "LOOCV", number = 1)
#
#rf <- train(
#  Class ~.,
#  data = labeled,
#  method = "rf",
#  tuneGrid = data.frame(mtry=1:(ncol(data)-1)),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  #preProc = c("scale")
#)
#ggplot(rf)

# all variables
set.seed(12345)
n <- ncol(labeled) -1
errRate <- c(1)
for (i in 1:n){  
m <- randomForest(Class~.,data=labeled,mtry=i,CV=TRUE)  
err<-mean(m$err.rate)  
errRate[i] <- err  
}  
#a= which.min(errRate) 




rf_sum <- data.frame(cbind(mtry=1:8, error_rate=errRate, acc = 1-errRate)) 

# my result is 2
mtry_best <- ggplot(rf_sum, aes(x=mtry, y=acc)) + 
  geom_vline(xintercept = which.min(rf_sum$error_rate)  , linetype="dashed", color = "brown", size=1)+
  geom_line(color = "brown", size=1)+
  geom_point(color = "black", size=3) +
  labs(y="Accuracy(LOOCV)",
  x ="Numb of Variables",
  title="Opt Num of Variables")
  
#rf.model <- 
  ((mtry_best + ~plot(m, main = 'Bagging Error'))) + plot_layout(widths = c(1, 2))

#confusionMatrix(labeled$Class, m$predicted)

rff <- randomForest(Class~., data = labeled, mtry=2)

rf.conf <- confusionMatrix(labeled$Class, rff$predicted)

rf.acc <- rf.conf$overall[1]

#ggsave(filename = "rf.model.png",
# width = 10, height = 7,
# dpi = 700) 
acc.data <- data.frame(cbind(Model=c("KNN", "QDA", "RF"), rbind(table(knn), table(qda$class), table(m$predicted)), Accuracy=round(rbind(knn.acc, qda.acc, rf.acc),3)), row.names = NULL )
#acc.data
model.plot <- ggplot(acc.data, aes(x=Model, y=Accuracy)) + geom_col(width = .60, aes(fill=Model), show.legend = F)


(wrap_elements(gridExtra::tableGrob(acc.data)) / model.plot ) + plot_layout(widths = c(1, 2)) + plot_annotation(
 title = "Model Predictions(TPR+TNR) and Accuracy")

pred.label.dat <- as.data.frame(cbind("class"=qda$class,
                                      "Eccentricity"=labeled$Eccentricity,
                                      "Extent"=labeled$Extent))%>%
  mutate(class=as.factor(ifelse(class=="1", "BOMBAY", ifelse(class=="2","CALI", ifelse(class=="3", "DERMASON", ifelse(class=="4", "HOROZ", ifelse(class=="5","SEKER", "SIRA" )))))))
grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+ 
  theme(legend.position="bottom"),
ggplot(pred.label.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "LOOCV qda Labeled") + theme(legend.position="bottom"), ncol=2)
Final selected model (QDA)

Final selected model (QDA)

sample_data <- data.frame(rbind(sampA, sampB, sampC))

qda <- qda(Class~., data = labeled, CV = FALSE)
qda_predict <- predict(qda, newdata = sample_data)

table(qda_predict$class)
## 
##   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
##       23      462      952      567      591      536
pred.samp.dat <- as.data.frame(cbind("class"=qda_predict$class,sample_data))
grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+ 
  theme(legend.position="bottom"),
ggplot(pred.samp.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "LOOCV qda Labeled") + theme(legend.position="bottom"), ncol=2)